173 research outputs found
A Stochastic Penalty Model for Convex and Nonconvex Optimization with Big Constraints
The last decade witnessed a rise in the importance of supervised learning
applications involving {\em big data} and {\em big models}. Big data refers to
situations where the amounts of training data available and needed causes
difficulties in the training phase of the pipeline. Big model refers to
situations where large dimensional and over-parameterized models are needed for
the application at hand. Both of these phenomena lead to a dramatic increase in
research activity aimed at taming the issues via the design of new
sophisticated optimization algorithms. In this paper we turn attention to the
{\em big constraints} scenario and argue that elaborate machine learning
systems of the future will necessarily need to account for a large number of
real-world constraints, which will need to be incorporated in the training
process. This line of work is largely unexplored, and provides ample
opportunities for future work and applications. To handle the {\em big
constraints} regime, we propose a {\em stochastic penalty} formulation which
{\em reduces the problem to the well understood big data regime}. Our
formulation has many interesting properties which relate it to the original
problem in various ways, with mathematical guarantees. We give a number of
results specialized to nonconvex loss functions, smooth convex functions,
strongly convex functions and convex constraints. We show through experiments
that our approach can beat competing approaches by several orders of magnitude
when a medium accuracy solution is required
On Optimal Probabilities in Stochastic Coordinate Descent Methods
We propose and analyze a new parallel coordinate descent method---`NSync---in
which at each iteration a random subset of coordinates is updated, in parallel,
allowing for the subsets to be chosen non-uniformly. We derive convergence
rates under a strong convexity assumption, and comment on how to assign
probabilities to the sets to optimize the bound. The complexity and practical
performance of the method can outperform its uniform variant by an order of
magnitude. Surprisingly, the strategy of updating a single randomly selected
coordinate per iteration---with optimal probabilities---may require less
iterations, both in theory and practice, than the strategy of updating all
coordinates at every iteration.Comment: 5 pages, 1 algorithm (`NSync), 2 theorems, 2 figure
Linearly convergent stochastic heavy ball method for minimizing generalization error
In this work we establish the first linear convergence result for the
stochastic heavy ball method. The method performs SGD steps with a fixed
stepsize, amended by a heavy ball momentum term. In the analysis, we focus on
minimizing the expected loss and not on finite-sum minimization, which is
typically a much harder problem. While in the analysis we constrain ourselves
to quadratic loss, the overall objective is not necessarily strongly convex.Comment: NIPS 2017, Workshop on Optimization for Machine Learning (camera
ready version
Semi-Stochastic Gradient Descent Methods
In this paper we study the problem of minimizing the average of a large
number () of smooth convex loss functions. We propose a new method, S2GD
(Semi-Stochastic Gradient Descent), which runs for one or several epochs in
each of which a single full gradient and a random number of stochastic
gradients is computed, following a geometric law. The total work needed for the
method to output an -accurate solution in expectation, measured in
the number of passes over data, or equivalently, in units equivalent to the
computation of a single gradient of the loss, is
, where is the condition number.
This is achieved by running the method for epochs,
with a single gradient evaluation and stochastic gradient
evaluations in each. The SVRG method of Johnson and Zhang arises as a special
case. If our method is limited to a single epoch only, it needs to evaluate at
most stochastic gradients. In
contrast, SVRG requires stochastic gradients. To
illustrate our theoretical results, S2GD only needs the workload equivalent to
about 2.1 full gradient evaluations to find an -accurate solution for
a problem with and .Comment: 19 pages, 3 figures, 2 algorithms, 3 table
Accelerated Gossip via Stochastic Heavy Ball Method
In this paper we show how the stochastic heavy ball method (SHB) -- a popular
method for solving stochastic convex and non-convex optimization problems
--operates as a randomized gossip algorithm. In particular, we focus on two
special cases of SHB: the Randomized Kaczmarz method with momentum and its
block variant. Building upon a recent framework for the design and analysis of
randomized gossip algorithms, [Loizou Richtarik, 2016] we interpret the
distributed nature of the proposed methods. We present novel protocols for
solving the average consensus problem where in each step all nodes of the
network update their values but only a subset of them exchange their private
values. Numerical experiments on popular wireless sensor networks showing the
benefits of our protocols are also presented.Comment: 8 pages, 5 Figures, 56th Annual Allerton Conference on Communication,
Control, and Computing, 201
Coordinate Descent Face-Off: Primal or Dual?
Randomized coordinate descent (RCD) methods are state-of-the-art algorithms
for training linear predictors via minimizing regularized empirical risk. When
the number of examples () is much larger than the number of features (),
a common strategy is to apply RCD to the dual problem. On the other hand, when
the number of features is much larger than the number of examples, it makes
sense to apply RCD directly to the primal problem. In this paper we provide the
first joint study of these two approaches when applied to L2-regularized ERM.
First, we show through a rigorous analysis that for dense data, the above
intuition is precisely correct. However, we find that for sparse and structured
data, primal RCD can significantly outperform dual RCD even if , and
vice versa, dual RCD can be much faster than primal RCD even if .
Moreover, we show that, surprisingly, a single sampling strategy minimizes both
the (bound on the) number of iterations and the overall expected complexity of
RCD. Note that the latter complexity measure also takes into account the
average cost of the iterations, which depends on the structure and sparsity of
the data, and on the sampling strategy employed. We confirm our theoretical
predictions using extensive experiments with both synthetic and real data sets
Nonconvex Variance Reduced Optimization with Arbitrary Sampling
We provide the first importance sampling variants of variance reduced
algorithms for empirical risk minimization with non-convex loss functions. In
particular, we analyze non-convex versions of SVRG, SAGA and SARAH. Our methods
have the capacity to speed up the training process by an order of magnitude
compared to the state of the art on real datasets. Moreover, we also improve
upon current mini-batch analysis of these methods by proposing importance
sampling for minibatches in this setting. Surprisingly, our approach can in
some regimes lead to superlinear speedup with respect to the minibatch size,
which is not usually present in stochastic optimization. All the above results
follow from a general analysis of the methods which works with arbitrary
sampling, i.e., fully general randomized strategy for the selection of subsets
of examples to be sampled in each iteration. Finally, we also perform a novel
importance sampling analysis of SARAH in the convex setting.Comment: 9 pages, 12 figures, 25 pages of supplementary material
One Method to Rule Them All: Variance Reduction for Data, Parameters and Many New Methods
We propose a remarkably general variance-reduced method suitable for solving
regularized empirical risk minimization problems with either a large number of
training examples, or a large model dimension, or both. In special cases, our
method reduces to several known and previously thought to be unrelated methods,
such as {\tt SAGA}, {\tt LSVRG}, {\tt JacSketch}, {\tt SEGA} and {\tt ISEGA},
and their arbitrary sampling and proximal generalizations. However, we also
highlight a large number of new specific algorithms with interesting
properties. We provide a single theorem establishing linear convergence of the
method under smoothness and quasi strong convexity assumptions. With this
theorem we recover best-known and sometimes improved rates for known methods
arising in special cases. As a by-product, we provide the first unified method
and theory for stochastic gradient and stochastic coordinate descent type
methods.Comment: 61 pages, 6 figures, 3 table
Stochastic Reformulations of Linear Systems: Algorithms and Convergence Theory
We develop a family of reformulations of an arbitrary consistent linear
system into a stochastic problem. The reformulations are governed by two
user-defined parameters: a positive definite matrix defining a norm, and an
arbitrary discrete or continuous distribution over random matrices. Our
reformulation has several equivalent interpretations, allowing for researchers
from various communities to leverage their domain specific insights. In
particular, our reformulation can be equivalently seen as a stochastic
optimization problem, stochastic linear system, stochastic fixed point problem
and a probabilistic intersection problem. We prove sufficient, and necessary
and sufficient conditions for the reformulation to be exact. Further, we
propose and analyze three stochastic algorithms for solving the reformulated
problem---basic, parallel and accelerated methods---with global linear
convergence rates. The rates can be interpreted as condition numbers of a
matrix which depends on the system matrix and on the reformulation parameters.
This gives rise to a new phenomenon which we call stochastic preconditioning,
and which refers to the problem of finding parameters (matrix and distribution)
leading to a sufficiently small condition number. Our basic method can be
equivalently interpreted as stochastic gradient descent, stochastic Newton
method, stochastic proximal point method, stochastic fixed point method, and
stochastic projection method, with fixed stepsize (relaxation parameter),
applied to the reformulations.Comment: Accepted to SIAM Journal on Matrix Analysis and Applications. This
arXiv version has an additional section (Section 6.2), listing several
extensions done since the paper was first written. Statistics: 39 pages, 4
reformulations, 3 algorithm
Randomized Quasi-Newton Updates are Linearly Convergent Matrix Inversion Algorithms
We develop and analyze a broad family of stochastic/randomized algorithms for
inverting a matrix. We also develop specialized variants maintaining symmetry
or positive definiteness of the iterates. All methods in the family converge
globally and linearly (i.e., the error decays exponentially), with explicit
rates. In special cases, we obtain stochastic block variants of several
quasi-Newton updates, including bad Broyden (BB), good Broyden (GB),
Powell-symmetric-Broyden (PSB), Davidon-Fletcher-Powell (DFP) and
Broyden-Fletcher-Goldfarb-Shanno (BFGS). Ours are the first stochastic versions
of these updates shown to converge to an inverse of a fixed matrix. Through a
dual viewpoint we uncover a fundamental link between quasi-Newton updates and
approximate inverse preconditioning. Further, we develop an adaptive variant of
randomized block BFGS, where we modify the distribution underlying the
stochasticity of the method throughout the iterative process to achieve faster
convergence. By inverting several matrices from varied applications, we
demonstrate that AdaRBFGS is highly competitive when compared to the well
established Newton-Schulz and minimal residual methods. In particular, on
large-scale problems our method outperforms the standard methods by orders of
magnitude. Development of efficient methods for estimating the inverse of very
large matrices is a much needed tool for preconditioning and variable metric
optimization methods in the advent of the big data era.Comment: 42 pages, 6 figures, 2 table
- …